Content-free Document Genre Classification using First Order Random Graphs

نویسندگان

  • Andrew D. Bagdanov
  • Marcel Worring
چکیده

We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the layout structure of document instances, and a first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fine-Grained Document Genre Classification Using First Order Random Graphs

We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our met...

متن کامل

Genre Classification of Web Documents

Retrieving relevant documents over the Web is an overwhelming task when search engines return thousands of Web documents. Sifting through these documents is time-consuming and sometimes leads to an unsuccessful search. One problem is that most search engines rely on matching a query to documents based solely on topical keywords. However, many users of search engines have a particular genre in m...

متن کامل

Searching in document images: what does the appearance of a document tell us about what it means?

The document understanding problem can be informally defined as the automatic extraction of meaning from documents. In the Intelligent Sensory Information Systems group we have experimented with analyzing the visual appearance of documents in order to extract meaning. That is, we concentrate on how documents look, rather than on what they say. We motivate this approach with several applications...

متن کامل

Thesis Stereotyping the Web: Genre Classification of Web Documents

OF THESIS STEREOTYPING THE WEB: GENRE CLASSIFICATION OF WEB DOCUMENTS Retrieving relevant documents over the Web is a difficult task. Currently, search engines rely on keywords for matching documents to user queries. This paper explores the potential for discriminating documents based on the genre of the document. I define genre as a taxonomy that incorporates the style, form and content of a d...

متن کامل

Classification of document page images based on visual similarity of layout structures

Searching for documents by their type or genre is a natural way to enhance the eeectiveness of document retrieval. The layout of a document contains a signiicant amount of information that can be used to classify a document's type in the absence of domain speciic models. A document type or genre can be deened by the user based primarily on layout structure. Our classiication approach is based o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001